随机梯度下降的理论性且有可能实用的问题是,轨迹可能逃到无穷大。在本说明中,我们研究了沿随机梯度下降算法的轨迹及其重要动量变体沿迭代元素和功能值的均匀界限。在损失函数的平滑度和$ r $ $ $ - 降解性下,我们表明,较宽的阶梯尺寸(包括广泛使用的踩踏和余弦),具有(或不使用)重新启动步骤尺寸,导致均匀界限和功能值。详细讨论了一些满足这些假设的重要应用,包括相位检索问题,高斯混合模型和一些神经网络分类器。我们进一步扩展了SGD的均匀界限及其在广义耗散性下的动量变体,其尾巴比二次函数慢的功能。这包括一些有趣的应用程序,例如,使用$ \ ell_1 $正则化的贝叶斯逻辑回归和逻辑回归。
translated by 谷歌翻译
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, largebatch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods-where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally-are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
translated by 谷歌翻译
In this paper we take the first steps in studying a new approach to synthesis of efficient communication schemes in multi-agent systems, trained via reinforcement learning. We combine symbolic methods with machine learning, in what is referred to as a neuro-symbolic system. The agents are not restricted to only use initial primitives: reinforcement learning is interleaved with steps to extend the current language with novel higher-level concepts, allowing generalisation and more informative communication via shorter messages. We demonstrate that this approach allow agents to converge more quickly on a small collaborative construction task.
translated by 谷歌翻译
To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
尽管进行了数十年的研究,但现有的导航系统在野外部署时仍然面临现实世界中的挑战,例如在混乱的家庭环境或人类占领的公共场所中。为了解决这个问题,我们提出了一类新的隐式控制政策,将模仿学习的好处与模型预测控制(MPC)的系统约束的强大处理结合在一起。我们的方法称为Performer-MPC,使用了通过表演者提供的视觉上下文嵌入的学习成本函数(一种低级隐式意见变压器)。我们共同训练成本函数并构建依靠它的控制器,有效地端到端解决相应的双层优化问题。我们表明,由此产生的策略通过利用一些在不同挑战的现实世界情景中利用一些专家演示来提高标准MPC绩效。与标准的MPC政策相比,表演者MPC在混乱的环境中实现了40%的目标,而在人类浏览时,社交指标的目标> 65%。
translated by 谷歌翻译
多臂强盗(MAB)问题是一个简单而强大的框架,在不确定性下的决策背景下进行了广泛的研究。在许多实际应用程序(例如机器人应用程序)中,选择ARM对应于限制下一个可用臂(动作)选择的物理动作。在此激励的情况下,我们研究了一个称为图形匪徒的mAb的扩展,在该图形上,试图从不同节点收集的奖励来传播图形。该图定义了代理在每个步骤中选择下一个可用节点时的自由度。我们假设图形结构完全可用,但是奖励分布未知。我们建立在基于脱机图的计划算法和乐观原则的基础上,我们设计了一种在线学习算法,该算法可以使用乐观原则来平衡长期探索 - 探索。我们表明我们提出的算法达到$ o(| s | \ sqrt {t} \ log(t)+d | s | s | \ log t)$学习后悔,其中$ | s | $是节点的数量和$ d $是该图的直径,与在类似设置下的最著名的增强学习算法相比,这是优越的。数值实验证实,我们的算法优于几个基准。最后,我们提出了一个由图形匪徒框架建模的合成机器人应用程序,其中机器人在农村/郊区位置网络上移动,使用我们建议的算法提供高速Internet访问。
translated by 谷歌翻译
当前的语言模型因单独从文本学习语言而没有单词及其含义之间的联系而受到批评。因此,已经提出了多模式训练,以通过提供缺乏联系来创建具有更好的语言理解模型的一种方式。我们专注于预先训练的多模式视觉和语言(VL)模型,这些模型已经有了他们的语言理解能力的一些结果。然而,评估这些模型的语言技能的一个尚未解决的问题是,没有建立的方法可以使它们在没有分发不确定性的情况下适应仅文本输入。为了找到最佳方法,我们研究并比较了将三种不同的预训练VL模型适应仅文本输入的七种可能的方法。我们对胶水和视觉属性规范(VPN)的评估表明,应注意将VL模型调整为零击文本任务,而模型对我们如何使其适应非零射击任务的敏感性不太敏感。我们还发现,适应方法对不同模型的性能有所不同,并且单形模型对应物与VL模型相同,无论适应如何,这表明当前的VL模型并不一定从其多峰训练中获得更好的语言理解。
translated by 谷歌翻译
在样本量有限的域中,有效的学习算法至关重要。使用特权信息(LUPI)学习,通过允许预测模型在培训时间访问信息类型,从而提高了样本效率,而在使用模型时,这是不可用的。在最近的工作中,有证据表明,对于线性高斯动力学系统的预测,具有中间时间序列数据访问的卢比学习者永远不会比任何公正的经典学习者更糟糕,而且常常更好。我们为该分析提供了新的见解,并将其推广到潜在动力学系统中的非线性预测任务,从而将理论保证扩展到连接潜在变量和观察值的地图已知到线性变换的情况下。此外,我们提出了基于随机特征和表示该地图未知的情况的表示算法。一套经验结果证实了理论发现,并显示了在非线性预测中使用特权时间序列信息的潜力。
translated by 谷歌翻译
我们考虑在重复的未知游戏中进行规避风险的学习,在这种游戏中,代理商的目标是最大程度地减少其个人产生高成本的风险。具体而言,代理商使用处于风险的条件值(CVAR)作为风险措施,并以每集选定动作的成本值的形式依靠强盗反馈来估算其CVAR值并更新其动作。使用匪徒反馈来估计CVAR的一个主要挑战是,代理只能访问其自身的成本值,但是,这取决于所有代理的行为。为了应对这一挑战,我们提出了一种新的规避风险的学习算法,并利用有关成本价值的完整历史信息。我们表明,该算法实现了子线性的遗憾,并匹配了文献中最著名的算法。我们为欧洲大师游戏提供了数值实验,该游戏表明我们的方法表现优于现有方法。
translated by 谷歌翻译